Getting a Data Science Job Using Data Science

A Presentation by Melanie Desroches

Background

Data Science Jobs

  • As the collection of data becomes more important across a variety of industries, the need for trained data scientists as increased.
  • According to the US Bureau of Labor Statistics, there has been a 36% growth in data science jobs (compared to the average job growth of 4%).
  • What does it take to get a data science related job?
  • What are the skills that are valued in these positions?

Research Question

What skills are important for different data science jobs?

  • What are the most important skills for a job as a data scientist?
  • What are the most important skills for a job as a data analyst?
  • What are the most important skills for a job as a machine learning engineer?
  • What are the most important skills for a job as a data engineer?

Data

Data

  • The dataset consists of 660 job postings related to data science
  • The data set was collected from Kaggle
  • The following is the summary of the job titles:
    • data scientist 447
    • other 68
    • analyst 55
    • data engineer 46
    • mle 34

Feature Engineering

  • To answer the research question, feature engineering:
    • Identified prevalent skills in the job descriptions.
    • Created boolean columns indicating whether a skill was mentioned.
  • The identified skills include:
    • Programming Skills: Python, SQL, Java, C++, Scala, R
    • Big Data Tools: Hadoop, Spark, AWS, Git, Linux, Azure
    • Visualization Tools: Tableau, Matlab
    • Soft Skills: Communication, leadership, collaboration, problem-solving, critical thinking

Methods

Random Forests

  • A Random Forest classifier was used to analyze the data.
  • To handle class imbalance, I used class weights within the model.
  • The model evaluated the presence of skills against job titles.
  • From the random forest, feature importance can identify the most important skills that predicted the job type
    • SHAP (SHapley Additive exPlanations) values: measure the impact of each feature on the model’s predictions.
    • Gini Importance: consideres the overall reduction in impurity caused by each feature

Results

Results for the Overall Model

Feature Importance for All Roles

Results for Analyst Jobs

Feature Importance for Analyst Roles

Results for Data Engineer Jobs

Feature Importance for Data Engineer Roles

Results for Data Scientist Jobs

Feature Importance for Data Scientist Roles

Results for Machine Learning Engineer Jobs

Feature Importance for Machine Learning Engineer Roles

Evaluating the Model

Overall, the model achieved an accuracy of 76.9%.

Summary of the Classification Report

  • Analyst: Precision and recall were both low (precision: 0.71, recall: 0.38).
  • Data Engineer: Precision was good (0.80), but recall was poor (0.36), suggesting the model was selective but missed many true instances of this class.
  • Data Scientist: This was the best-performing class, with high precision (0.79), recall (0.92), and F1-score (0.85).
  • MLE: Performance was moderate, with a precision of 0.75 and a recall of 0.43, leading to an F1-score of 0.55.
  • Other: The model performed equally in precision and recall (0.60 for both), with a balanced F1-score of 0.60.

Discussion

Possible Limitations

  • The data set on Kaggle is listed at about 4 years old, which may make the data not as relevant to today’s job market
  • I used a very naive approach to the feature engineering so certain features may have been missed.
  • Due to the heavy imbalance in the dataset, it was hard to improve the model for more accurate results.

What did we learn?

  • While there is a lot of overlap between different types of data science jobs, there is also certain skills that make each unique
  • Python, SQL, Communication are important for all of the jobs
  • Certain soft skills are more valuable to certain jobs (for example, data scientists and collaboration, analysts and teamwork, data engineers and problem-solving, machine learning and leadership)

Thank you